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Abstract 

Background: The advent of genome-wide association studies has led to many novel disease-SNP associations, 
opening the door to focused study on their biological underpinnings. Because of the importance of analyzing these 
associations, numerous statistical methods have been devoted to them. However, fewer methods have attempted to 
associate entire genes or genomic regions with outcomes, which is potentially more useful knowledge from a 
biological perspective and those methods currently implemented are often permutation-based. 

Results: One property of some permutation-based tests is that their power varies as a function of whether significant 
markers are in regions of linkage disequilibrium (LD) or not, which we show from a theoretical perspective. We 
therefore develop two methods for quantifying the degree of association between a genomic region and outcome, 
both of whose power does not vary as a function of LD structure. One method uses dimension reduction to "filter" 
redundant information when significant LD exists in the region, while the other, called the summary-statistic test, 
controls for LD by scaling marker Z-statistics using knowledge of the correlation matrix of markers. An advantage of 
this latter test is that it does not require the original data, but only their Z-statistics from univariate regressions and an 
estimate of the correlation structure of markers, and we show how to modify the test to protect the type 1 error rate 
when the correlation structure of markers is misspecified. We apply these methods to sequence data of oral cleft and 
compare our results to previously proposed gene tests, in particular permutation-based ones. We evaluate the 
versatility of the modification of the summary-statistic test since the specification of correlation structure between 
markers can be inaccurate. 

Conclusion: We find a significant association in the sequence data between the 8q24 region and oral cleft using our 
dimension reduction approach and a borderline significant association using the summary-statistic based approach. 
We also implement the summary-statistic test using Z-statistics from an already-published GWAS of Chronic 
Obstructive Pulmonary Disorder (COPD) and correlation structure obtained from HapMap. We experiment with the 
modification of this test because the correlation structure is assumed imperfectly known. 
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Background 

The focus in genetic association studies has been on 
uncovering loci that are risk factors for an outcome, be 
it binary or continuous, or markers in linkage disequilib- 
rium (LD) with those causal loci. Increasingly, however, 
gene-based tests are coming to the forefront, especially 
as sequencing technologies mature and grow cheaper 
[1-3]. Gene-based tests are useful to provide insight into 
whether a region of the genome has a significant asso- 
ciation with some outcome and for inter-gene signif- 
icance comparisons, despite differences in the size of 
genes [4,5]. Development of such tests is difficult, how- 
ever, as markers are usually correlated with one another 
and have highly variable minor allele frequencies [6] . As 
a result, tests have often been born more out of prac- 
ticality or computational ease. Some gene-based tests 
take the smallest p-value over all the markers in the 
region. Others, such as that implemented in PLINK, take 
a more sophisticated approach, converting p-values of 
markers to x 2 test statistics, averaging those tests statis- 
tics, then comparing it to a null distribution generated 
from permutations of the outcome under the null [1,7]. 
Liu et al. (2010) use a similar, though more efficient 
method, in which they again convert marker p-values 
to X\ test statistics, take the sum of those test statis- 
tics, then generate a null distribution by sampling from 
sums of correlated X\ random variables. Both approaches 
are intuitive and valuable ways to assess gene signifi- 
cance, though in both cases the power for detection of 
a gene becomes not only a function of the effect size of 
the individual markers, but the degree to which mark- 
ers are in LD with one another. For example, assuming 
only one marker in the region has a truly non-zero effect 
size, power for detecting that effect will be higher if the 
marker is in high LD with other markers than if it is 
independent of them. Moskvina et al. (2012) indepen- 
dently made this observation, having noticed that the 
significance of regions they tested changed according to 
how much they pruned markers in high LD with one 
another [8]. 

One way to think about why this phenomenon occurs 
is that, rather than transforming the test statistic so that 
markers highly correlated with one another "mean less" 
because they do not contribute independent information, 
they transform the null distribution for certain markers 
under the null to "mean more." As a result, the type 1 
error is maintained, but power varies as a function of the 
correlation between the marker and surrounding mark- 
ers. Intuitively, this is not a desirable property because 
it will lead to a systematic under-detection of those loci 
that happen to be independent of proximal markers even 
though they are inherently no less important in predicting 
the outcome. This issue becomes particularly problem- 
atic for sequence data since there would generally be 



even more correlation. However, the issue is a zero-sum 
trade-off; what results in less power for detection of sin- 
gle nucleotide polymorphisms (SNPs) in low LD trans- 
lates to more power for detection of SNPs in high LD. 
Though, if there is an underlying common function or 
characteristic of those genomic regions whose signifi- 
cant SNPs are not in high LD, perhaps due to when 
they first occurred in evolutionary history, such regions 
will likely be missed in association analyses so that 
potentially key regions will not be studied in greater 
depth. As a result of this shortcoming, which may be 
more or less important depending on the specific LD 
structure of the genomic region under study, we pro- 
pose two methods, one of which transforms summary 
Z-statistics from univariate regressions of markers so 
that it follows a standard parametric distribution under 
the null hypothesis and power does not vary with the 
LD structure, and the second of which uses an Eigen 
decomposition of the information matrix to find the 
"effective" amount of information in the region and 
increases power by performing a more parsimonious test. 
If the information matrix is evaluated under the null, 
this latter test is essentially a dimension-reduced score 
test analogue to a method described in [2,3], which finds 
the principal components of the data matrix. Specif- 
ically, for the first approach we propose, we find Z- 
statistics associated with each marker in our region and 
the correlation matrix of the markers and perform a 
X 2 test, an approach similar to that proposed by Yang 
et al. (2012) [9]. In case the correlation matrix is imper- 
fectly known, we propose a modification of this test that 
adjusts the correlation structure to protect the type 1 
error. In the latter approach, we calculate the eigenvec- 
tors associated with the information matrix to obtain 
a most powerful linear combination of the scores, on 
which we again perform a x 2 test after having nor- 
malized by the variance of the loadings. These two 
approaches are proposed for different situations: the 
dimension reduction approach for when there is cor- 
relation between markers and the analyst has access 
to the original data, and the summary-statistic test for 
when the analyst only has access to Z-statistics from 
univariate regressions of the outcome on the marker 
along with an estimate of the correlation matrix of 
the markers in the population from which they come. 
Moskvina et al. (2012) also propose solutions, one of 
which is based on Hotellings T 2 test, while another is 
based on multivariate logistic regression, though con- 
cludes that both perform similarly. We compare these 
various approaches under different structures of LD and 
effect size. We apply our methods to a case-control 
sequence data set of oral cleft and an already-published 
G WAS study of Chronic Obstructive Pulmonary Disorder 
(COPD) [10]. 
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Methods 

Description of permutation-based gene tests 

First we show how power differs for permutation-based 
gene tests as a function of linkage disequilibrium from a 
theoretical perspective. When we refer to permutation- 
based gene tests, we mean gene-based tests in which the 
sum of the x 2 statistics for markers is taken and then an 
empirical p-value is calculated by permuting case-control 
status to generate a null distribution. By Imhof (1961), in 
connection with Liu et al. (2010), we know that the null 
distribution of the permutation-based test is Yll=i ^iXi> 
where X\ ls a chi-squared random variable on 1 d.f., ki is 
the i th eigenvalue of E, the q x q correlation matrix of the 
SNPs comprising the gene to be tested. Under the alter- 
native, the distribution is approximately 5Zf =1 kiX± (<$?), 
where the non-centrality parameter 5/ is calculated 
Si = v\ • [il^fku where V; is the eigenvector of £ corre- 
sponding to Xi, and fi is the q-dimensional mean of the 
multivariate normal distribution of Z-statistics calculated 
for univariate regressions of each SNP. ft is a function 
of the power for detection of each SNP in the gene. 
The distribution under the alternative is approximately 
5Zf=i ^iX\ (&f)> rather than exactly, because the correla- 
tion matrix of marker Z-statistics coming from univariate 
regressions diverges from the correlation structure of the 
covariates when under the alternative. However, so long 
as there is not significant variation between observations 
in the true probability of being a case, this divergence 
will not be relevant. Since the relative risk of disease con- 
ferred by most minor alleles is small, it is likely that the 
approximation is valid in most studies. 

Suppose there is a single causal SNP X\ and, without loss 
of generality, it is the first entry in the (q + 2) -marker gene 
and q other SNPs, Z/, 1 < i < q } are correlated with it but 
do not cause the outcome. Also assume that the correla- 
tion coefficient between X\ and Z{ is pu and the last SNP, 
X<i, is uncorrected with X\ and does not cause the out- 
come. The first entry of ft, which represents the mean of 
the Z-statistic for Xi, can be written k\ • +Jn for some num- 
ber k\ where n is the sample size. Since the asymptotic 
relative efficiency for using Z; rather than X\ is pf, the i th 
entry of fi, that associated with Z/, can be written k\ • pv^/n 
[11,12]. The entries of /xi, the (q-\- 2) -dimensional mean of 
the Z-statistics corresponding to the permutation-based 
gene test when the causal SNP is X±, are 

fii T ' = (kiy/n, h-piy/n, h- p 2 y/n 1 . . . , kvp q y/n, 0). 

In contrast, suppose that the correlation structure 
among SNPs is the same, that is, Cor(Xi,Z/) = pi for 
1 < i < q, but X\ does not cause the outcome, and instead 
Xii the SNP uncorrected with all other (q + 1) markers, 
causes the outcome and to the same degree as X\ did so in 



the previous scenario. Then /X2, the (q + 2) -dimensional 
mean in this case is 

fi 2 T = (0, 0, 0,..., 0, kiy/n). 

If Q = 5Zf=i kiXi($j)> tne P ower °f an s i ze a (i- e -> tyP e 
1 error rate) permutation-based gene test is P(Q > c*), 
where c* is the (1 — a) quantile of the random vari- 
able Y^i=\ ^iXi- The intuition behind the power gains for 
causal SNPs in regions of LD is that the non-centrality 
parameters 8[ will generally be larger when the causal SNP 
is in a region of LD than when it is not. Providing greater 
rigor than this intuition is difficult because the calculation 
of Si for all 1 < i < (q + 2), even in the simple case of 
pi = pj for all 1 < i < q, can be complicated. However, 
sampling from the appropriate distributions demonstrates 
that there is greater power to detect a gene-outcome asso- 
ciation when the causal SNP is in a region of LD. Figure 1 
shows that under the alternative of gene-outcome asso- 
ciation and for a fixed effect size and pi = pj = p for 
1 < i < q, i.e., the (q + 2) x (q + 2) correlation matrix £ is 



1 p p 
pip 
p p 1 



p 0 
p 0 
p 0 



p p p 

0 0 0 



1 0 
0 1 



the distribution of the test statistic Q when the causal SNP 
in the gene is in a region of LD is stochastically greater 
than the distribution when the causal SNP in the gene 




test stat when causal SNP not in LD 

Figure 1 Q-Q plot of permutation-based gene test statistic 
under alternative for correlated region versus uncorrelated 
region. The higher the correlation in the LD block containing the 
causal SNP, the more power relative to the causal SNP not in the LD 
block using the permutation-based test. Lines are labelled with the 
correlation coefficient for SNPs in the block. 
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is not in a region of LD. As a result, power is greater 
to detect a gene whose causal SNP is in a region of LD 
for a test of any size a in a permutation-based gene test. 
Figure 1 was generated assuming a gene consisting of 7 
markers, where 6 markers were correlated with coefficient 
p, shown for values 0.2, 0.5, and 0.8 in the figure, with the 
causal SNP in the LD block (y-axis) versus not in the LD 
block (x-axis). While the example may seem contrived, if 
we consider q = 10 so that our gene consists of 12 SNPs 
in total, the correlation structure in this example is simi- 
lar to choosing the first 12 SNPs of BRCA1 [13], in which 
case p ^ 0.96, and where the last row and column above 
would be approximately 0.24 rather than 0. As a result, 
the variation in power observed when using permutation- 
based tests may have real-world import on the groupings 
of SNPs able to be detected. 

Figures 2 and 3 demonstrate from a graphical perspec- 
tive how permutation-based gene tests can have variable 
power as the LD structure changes. The figures illustrate 
that there is more power for the permutation-based gene 
test when causal SNPs are in high LD blocks as com- 
pared to causal SNPs in low LD blocks. Additionally, if a 
causal SNP is not in LD with other SNPs, but large LD 
blocks exist in the gene, power for the permutation-based 
gene test decreases as the size of the block increases. Data 
were generated with a minor allele frequency of all SNPs 
of approximately 0.3 with Hardy- Weinberg equilibrium 
assumed, and, within the LD block, correlation between 
SNPs was approximately 0.65, whereas SNPs not in the 
LD block were independent of one another. The gene con- 
sisted of 20 SNPs, and there were 600 subjects with an 




0.4 0.6 0.8 

Effect Size (Log OR) 

Figure 2 Power for permutation-based gene test as a function of 
causal SNP's location relative to correlated region and effect 
size. There is more power to detect a SNP in high LD with other, 
non-causal SNPs (solid, black line), than a SNP in low LD (dashed, blue 
line) for an identical effect size. 



o 
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Size of LD block 

Figure 3 Power for permutation-based gene test as a function of 
causal SNP's location relative to correlated region and size of LD 
block. For a constant effect size, size of the LD block in which the 
causal SNP is (or is not) located is related to the power for detection. 
The black, solid power curve refers to the permutation-based gene 
test when the causal SNP is in the LD block, and the blue, dashed 
power curve refers to the test when the causal SNP is not in the LD 
block. 



equal number of cases and controls. Power calculations 
were based on 1000 iterations at each effect size (Figure 2) 
or LD block size (Figure 3). We calculated power at 18 dif- 
ferent effect sizes (Figure 2), with the effect size ranging 
from a log odds ratio (OR) of 0 to 1.2, and 20 different 
LD block sizes (Figure 3), with the LD block size rang- 
ing from 1 SNP to 20 SNPs. So when the size was 20 
SNPs, the LD block was the entirety of our hypothetical 
gene (Figure 3). Binary outcomes were generated assum- 
ing a logistic regression model, where the presence of the 
causal SNP determined probability of being a case or con- 
trol. Simulations assumed a single causal marker to clearly 
illustrate how power changes as a function of location of 
the SNP within the LD structure, though as the theoreti- 
cal work above shows, the result holds for any number of 
causal markers. 

As described in the Additional file 1 section of the 
manuscript, the authors have posted an R script online 
that allows one to see power variation for permutation- 
based gene tests as a function of the correlation struc- 
ture, power to detect the causal marker (in a univariate 
regression), and its location relative to the other SNPs 
comprising the gene. 

Description of summary statistic based test 

We first describe a simple solution to the problem of 
how LD structure can affect the power to detect genomic 
regions in which there are significant SNPs. Our solution 
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is based on the Z-statistics associated with each marker 
and the correlation matrix of the SNPs. Since we propose 
this test as one that can be used without a full data set, 
we propose a modification of it in case the true corre- 
lation structure is not perfectly known or it is believed 
that study participants are not reflective of the population 
from which the correlation of SNPs are calculated (such as 
with HapMap reference panels). 

One solution to the problem of under-detection of SNPs 
in low LD posed by permutation tests is transformation 
of the gene-based test statistic so that under the null it 
follows a standard parametric distribution, rather than 
creating a non-standard null distribution through per- 
mutations. One way to accomplish this task, and one in 
which it is unnecessary to reanalyze data, is to perform 
a joint test on the Z-statistics coming from a univariate 
regression model for each marker. It is a an approach sim- 
ilar to that described by Yang et al. (2012), though uses 
summary statistics directly rather than estimated model 
coefficients. Since the estimated covariance structure of 
these statistics under the null is the correlation of the 
markers themselves [14], one can use the data to estimate 
the covariation of the Z-statistics or an online database of 
LD or correlation structure of SNPs. The intuition behind 
this result is that if two markers are highly correlated, 
then when by chance under the null, one marker is sig- 
nificant (or insignificant), the other marker will similarly 
be significant (or insignificant). However, if two mark- 
ers are not correlated, then the chance significance or 
insignificance of one marker will not inform the signif- 
icance of the other marker. And since Z-statistics have 
variance 1 by definition, their covariance matrix is iden- 
tical to their correlation matrix. Thus, supposing we have 
q markers, which, from previous studies, are known to 
have Z-statistics of Z = (Zi, . . .,Z q ) T , and which have 
correlation structure V, then under the null hypothesis 
of no marker being associated with the outcome, T = 
Z T V~ 1 Z ~ Xq. One then rejects the null of no associa- 
tion between the region composing the q markers and the 
outcome for an extreme value of the test statistic T using 
a pre-determined a level. 

If one is not confident that V accurately reflects the cor- 
relation of the SNPs in the data matrix and therefore Z 
under the null hypothesis (because V is an estimate possi- 
bly coming from a reference panel if the original data is not 
available and the analysis is performed only with access to 
summary Z-statistics), it is possible to construct a more 
conservative test by shrinking the off-diagonal elements of 
V towards 0. Thus, if V is an estimate of the covariance of 
the SNPs, one can compute V* = yV + (1 — y)l q , where 
Iq is a q x q identity matrix and 0 < y < 1. 

Again using [15], if Z ~ MVN(0, V) but we use V* as an 
estimate of the correlation structure in the gene based test, 
then Z T V*~ l Z ~ J2l=i ' Xi where q is the dimension of 



vector Z, and where X/ is the i th eigenvalue of VV* 1 . By 
construction of V* -1 , Y^=i < dim(Z) for 0 < y < 1, 
where dim(-) the dimension of the vector argument. This 
fact in itself does not not necessarily imply a more con- 
servative test for all size a tests because when eigenvalues 
are not equal to one another as is the case for the decom- 
position of VV*' 1 with V ^ V*- 1 , YlLi x i < dim(Z) 
can be true, but Z T V*~ X Z is not stochastically less than 

XdimCZ)' tne nu ^ distribution of the test statistic when 
the correlation structure is correctly known. However, for 
modest values of y (i.e., 0.8-1.0, where 1.0 corresponds to 
no transformation), the test using the adjusted correlation 
matrix will generally be more conservative. It is difficult to 
obtain simple solutions for how much conservative a test 
will be using this modification since it will depend on the 
quantile corresponding to the intended type 1 error and 
the specific VV*- 1 . 

Thus, to give a practical sense of useful values of y, we 
borrowed a correlation structure of eight SNPs in the INS- 
IGF2 gene of Chromosome 11 from the CEU reference 
panel in one case and the CHB+JPT reference panel in the 
other case [13]. If the true, underlying population giving 
rise to the SNPs was more reflective of the CEU reference 
panel, but the analyst incorrectly guessed the correlation 
structure to be that of the CHB+JPT reference panel when 
performing the summary statistic gene-based test,the type 
1 error rate for a nominal 0.05 size test would in fact 
be a highly inflated 0.61. Similarly, if the true, underlying 
population giving rise to the SNPs was more reflective of 
the CHB+JPT reference panel, but the analyst incorrectly 
guessed the correlation structure to be that of the CEU 
reference panel for the summary statistic gene-based test, 
the type 1 error rate for a nominal 0.05 size test would be 
0.69. If the type 1 error is inflated in one scenario, there 
is no implication that it will be deflated in the 'inverse' 
scenario. 

In the scenario where the underlying population was 
more reflective of the CEU panel, the type 1 error using 
our modified summary statistic test with adjusted correla- 
tion matrix and y's of 0.9, 0.5, and 0.3 led to reduced error 
rates of 0.36, 0.11, and 0.09, respectively, instead of 0.61. 
When the underlying population was CHB+JPT, the type 
1 error using our adjustment correlation matrix and y's of 
0.9, 0.5, and 0.3 led to reduced error rates of 0.45, 0.10, and 
0.07, respectively, instead of 0.69. 

Type 1 error rates were calculated by sampling from an 
eight-dimensional multivariate normal distribution with 
mean vector 0 (i.e., under the null) and unit variance for all 
elements, but whose correlation structure corresponded 
to the "true" correlation structure of the population in any 
one of these two situations described, be it that of the 
CEU or CHB+JPT panel. We then multiplied each sample, 
Zi, ... Z n , with n = 2000, by the estimated correlation 
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matrix, V y , by taking ZjV~ l Zi, where V was taken from 
the panel unreflective of the truth (e.g., the CEU correla- 
tion structure was used for an underlying, true correlation 
structure corresponding to the CHB+JPT panel and vice 
versa) and possibly modified with the y parameter. To find 
the error rate, we calculated the probability mass beyond 
the 0.95 quantile of x|, the distribution ZjV~ l Zi would 
follow if Var{Z{) = V y (i.e., assuming the true correlation 
of Zi was known). 

While in all cases, the nominal size of the test is not 
quite achieved, type 1 error is greatly reduced, and in some 
cases will be achieved when divergence between correla- 
tion structures of the true and hypothesized populations 
are not as great as that in these scenarios. The greatest 
reduction in type 1 error occurs with initial deviation of 
y from 1; i.e., a movement of y from 1 (indicating an 
unadjusted correlation matrix) to 0.9 will reduce type 1 
error more than a movement of 0.6 to 0.5. And, as men- 
tioned, with very small values of y, there is not necessarily 
a guarantee of continued reduction in type 1 error for 
some nominal a level tests, nor should such y values 
be used if indicative of no confidence in ones estimated 
correlation matrix. 

To simulate a less drastic divergence between true and 
estimated correlation matrices and assess error rates and 
the proposed adjustment method in that context, we 
generated correlation matrices whose entries were beta- 
distributed random variables with means corresponding 
to the entries in the CHB+JPT reference panel and stan- 
dard deviation approximately 0.03-0.04 (approximately 
because standard deviation is partly a function of the 
mean). With a population whose underlying correlation 
structure was in truth reflective of the CHB+JPT panel, 
but using the generated correlation matrices in our cal- 
culations of the test statistic, the average type 1 error 
rate was 0.19. Adjusting the generated correlation matri- 
ces according to our method and with a y value of 0.95, 
the error rate was reduced to 0.05. Adjustment of the gen- 
erated correlation matrices with a y value of 0.90 led to a 
type 1 error rate of 0.03. 

The summary statistic based test we have proposed is 
a viable way of performing gene-based testing when one 
does not want power to vary as a function of the cor- 
relation structure of the SNPs composing the gene. A 
weakness of such an approach is an inability to know 
the underlying correlation structure of the SNPs used 
in the univariate regression analyses giving rise to the 
Z-statistics used in the summary statistic test. We have 
shown that incorrect guesses of the underlying correla- 
tion structure can lead to a significant increase in the 
type 1 error rate and therefore have proposed an adjust- 
ment method which can lead to achievement of error rates 
in line with the nominal size of the test. However, since 
by supposition of this setting the correlation structure 



of SNPs is never known, it is impossible to know the 
needed value of y. As a result, it may be best to perform 
ones summary statistic based test with y values rang- 
ing from 0.8-1.0 as a sensitivity analysis to see how one's 
conclusions change based on different values. Values of y 
smaller than 0.8 probably reflect little confidence in the 
estimated correlation structure, in which case feasibility of 
the analysis in the first place should be reassessed. 

Description of Eigen decomposition-based test 

The above approach controls for the LD structure of the 
region under study by transforming the test statistic so 
that LD no longer affects the power to detect significant 
regions. However, there are other ways one can make use 
of the LD structure to construct more powerful tests, such 
as by dimension reduction. Consider an extreme example 
where an investigator is interested in a region with d 
SNPs, and these SNPs are in nearly perfect LD so that a 
correlation matrix of them has off-diagonal entries close 
to 1. Because they are highly correlated, the association 
between any SNP and the outcome adds little information 
on top of that between any other SNP and the outcome. As 
a result, intuition may tell us that using a d-d.f. test on the 
region after having properly accounted for the underly- 
ing LD structure is not the most powerful approach since 
there is essentially the information of 1 SNP contained in 
the entire region. On the other hand, it is difficult to jus- 
tify focusing on any one SNP over another as one might do 
when "tagging" the region. Also, while no additional SNP 
contributes much information over another, there is still 
some amount of additional information contained in each 
one that, ideally, would not be ignored. 

Finding the eigenvectors and values of the information 
matrix is one way to approach this scenario. Unlike the 
previously proposed summary statistic test, this analy- 
sis approach requires the original SNP data. It gleans the 
essential information from the LD block, thus stripping 
away extraneous information that dilutes the power of 
proposed tests while avoiding the arbitrariness of prun- 
ing the number of SNPs being examined. It is an approach 
similar to finding the principal components of the data 
matrix and then regressing the outcome on those com- 
ponents if the information matrix is evaluated under the 
null [2,3] and may even be thought of as a score test 
analogue to it. If certain covariates have been shown to 
control for population stratification, it also may be fitting 
for the matrix to be evaluated under the alternative using 
the estimated effect sizes of those covariates. Also, as sim- 
ulations demonstrate, there may be power gains under 
certain correlation structures or when effect or sample 
sizes are small. Since the information matrix is the covari- 
ation of the scores associated with each marker and since 
score functions of highly correlated markers are correlated 
as well, identifying the chief axes of the covarying scores is 
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synonymous with finding the eigenvectors of the informa- 
tion matrix. One can then detect small deviations from the 
mean under the null hypothesis, the 0 vector, by perform- 
ing a parsimonious test. Additionally, if we are considering 
the underlying model to be that of logistic regression, both 
the information matrix and the score have simple forms 
and are computationally tractable. 

We now describe how to construct the test, which will 
place no constraint on estimation of the intercept, but 
reduce the dimension of the covariates underlying the 
scores. The score function associated with the j th marker 
under a logistic regression model is 



While the (/, l) th entry in the information matrix is 
Iq+i (/',/) = Cov(S(Pj),S(pi)) = CovC^yiXip^yiXu) 

i i 

i 

exp(y^ w _ n fi m Xim) 



ex p(E* PkXik) 



— XjjXji - 



1 + exp(J2 m =0 PmXim)} 2 



where we could estimate P(yi = 1) under an intercept- 
only model, i.e., the proportion of cases, or another model 
that included potential confounders. For the sake of expla- 
nation, we will proceed as if using the intercept-only 
model. The information matrix for the logistic regression 
model is then 



Iq+i 



Iq+1 (1,1) Iq+1 (1,2) 
Iq+1 (2,1) Iq+1 (2,2) 

^+1(^+1,1) ^+1(^+1,2) 



Iq+1 (hq+l) 
Iq+1 (l f q+l) 

Iq+1 (q+l,q+l) _ 



where, as is consistent from our definition of Ijj, Iy = Ijj. 
Also, I is (q + 1) x (q + 1) because there are q markers and 
1 intercept available for use in the model. 

Now define I q to be the q x q information matrix for the 
covariates, not including the intercept. That is, I q is I q +\ 
without the first column and first row of I q +\. I q can be 
decomposed into E W E~ r , where E = (e\, • • • ,e q ) 
is a matrix of the q eigenvectors eu 1 < / < q, and 
W = diag(Ai, A.2 • • • t^q) is a diagonal matrix of the 
corresponding eigenvalues A.,-, 1 < i < q. While one 
can use a variable number of eigenvectors in the analy- 
sis, if we suppose that we are in the situation described 
above where all d markers are highly correlated, then mak- 
ing use of just the first component may be sufficient to 
adequately encompass the information contained in the 
genomic region. More generally, a systematic criterion for 
deciding which eigenvectors to use is employing all those 



whose associated eigenvalues are larger than the average 
eigenvalue. 

For the sake of explanation, we suppose first that we will 
construct the test using only the eigenvector associated 
withe largest eigenvalue and then generalize later. Denote 
e\ the first column of E and vector associated with the 
largest eigenvalue (assume the columns of E are ordered 
according to decreasing eigenvalue). The interpretation 
of e\ is the axis of maximum variation of the distribu- 
tion whose covariance matrix is I q , and X\, the associated 
eigenvalue, can be interpreted as the variation along that 
axis. Since I q is q x q, e± is a (q x 1) unit eigenvector. 
Define a new 2x2 information matrix /* as /*(i,i) = 
1^+1(1,1)^* (2,2) = and /*2,i = 1*1,2 = e\ T - I q (,i), 
where I q (,i) is the first column of I q , v T denotes the trans- 
pose of vector v, and e\ T • I q (,i) denotes the dot product 
of vectors e\ and Iq^i). The test statistic for the 1 d.f. 
score test analogue of the method described in [2,3] is 
then (S t • e\) 2 -[ (7* -1 )(2,2)L where S is the ^-dimensional 
vector of scores associated with the q markers, which fol- 
lows a Xi distribution under the null hypothesis of no 
gene-outcome association. 

To generalize the method to using p eigenvectors, sim- 
ilar to regressing the outcome on the first p principal 
components of the data matrix, again perform an Eigen 
decomposition of I q , and define ei,...,e p as the p unit 
eigenvectors of length q associated with the p largest 
eigenvalues. Call those associated eigenvalues X\, . . . ,X p . 
Define a new information matrix /** as /**(i,i) = 



I**(m,m) = k m (where 2 < m < p), I** (m , n ) = 
( W>m ) = 0 (where m ^ n and 2 < m,n < p), and 



M+i (i,D 



I**(m,i) = I**(i,m) = e m T - I q (,i) (where 1 < m < p 
and I q (,i) denotes the first column of I q Y). Note that the 
off-diagonals of /** which are neither the first row nor 
first column are zero by the orthogonality of eigenvectors; 
i.e., for 1 < m,n < p, m ^ n, I**{ m ,n) = I**{n,m) = 
el • I q • e m = e T m • I q • e n = 0. /** is (p + 1) x (p + 1) and 
looks as follows 



L q+\ (1,1) 



T la 



e\ T Iq(,i) e 2 T Iq(,i) ... e p 



e ^ 'Iq(,D 



e 2 ^(,1) 



' Iq Gl) 



0 



0 

X 2 



' Iq (4) 
0 

0 



Define I** p ^ p as the lower-right p x p sub-matrix 
of 7**" 1 . The test statistic is S T • (a . . . e p ) • P* p \ p • 
(ei ... e p ) T • S, which follows a x p distribution under the 
null hypothesis of no gene-outcome association, where 
again S is a vector of scores of length q. 

Oftentimes in GWAS, population stratification can 
obscure the relationship between markers (or groups of 
markers) and outcomes. In these settings, it is necessary 
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to account for stratification by fitting models with covari- 
ates or ancestry informative markers (AIM) that adjust for 
the different populations composing the sample. Reducing 
the dimension of such covariates along with the mark- 
ers making up the gene renders them less effective if not 
useless for their intended purpose of controlling for pop- 
ulation stratification. Thus, it is necessary to construct a 
score test where only a chosen subset of the covariates 
have their dimension reduced and the information matrix 
is evaluated under the alternative for those covariates 
whose dimension is not reduced. Doing so is not diffi- 
cult and only requires treatment of the adjusting covariate 
in the quasi-information matrix as we treated the inter- 
cept in /**, where the off-diagonal entries were a linear 
combination of the appropriate eigenvector and ^-length 
sub-column of the original information matrix. So sup- 
pose there are q markers and we only want to reduce the 
dimension of the last (q — 1) of this group. Let I( q +\) be 
the (q + 1) x (q + 1) information matrix and define I( q -i) 
as the lower right (q — 1) x (q — 1) sub-matrix of 
Decompose I{ q -\) into E f W f £'~\ where E f is the matrix 
of (q — 1) eigenvectors, e'- for 1 < ; < (q — 1), of I( q -i) 
and W f is the diagonal matrix of corresponding eigenval- 
ues, Ay for 1 < j < (q — 1), and we use E f and W to 
differentiate these matrices from those defined above and 
not to indicate the transpose of these matrices. Suppose 
we want to use the first p' eigenvectors for our test of the 
(q — 1) markers in the group whose dimension we reduce 
and where p' < (q — 1). The quasi-information matrix is 
defined 



J q+l (1,1) 
7 <7+l (2,1) 



2 q+l (1,2) 
J q+1 (2,2) 



LCD e i 



H2) 



HD 
H2) 



*q-i 02) 



Analogous to the test statistic defined in the previous 
test, define /*** ~, 1 , as the lower-right {p 1 x p') sub-matrix 

of and S f as the vector of scores associated with 

the (q — 1) markers whose dimension we reduce. The test 
statistic is S' T • (e\ . . . £?',) • , • (e\ ... e' ,) T • 

v 1 p' p'xp' v 1 p' 7 ' 

which follows a distribution under the null hypothesis 
of no gene-outcome association. 

While there was extended discussion of the type 1 error 
rate for the previous, summary statistic, test, and how it 
varies as a function of the shrinkage parameter, y, since 
V was considered possibly misspecified, there is no such 
necessary discussion of type 1 error for the Eigen decom- 
position test; since we assume perfectly known data for 
the Eigen decomposition test, it is a relatively standard, 
parametric, hypothesis test so that asymptotic results hold 
and type 1 error rates correspond to the nominal size of 
the test. 



Results and discussion 

Simulation results for summary statistic based test and 
comparison with Hotelling's T 2 

Moskvina et al. propose a test based on Hotellings T 2 . 
If one knows the true information matrix, it is a mul- 
tivariate score test and follows a X 2 distribution under 
the null. Supposing that there are q markers and S = 
(S(Pi), . . . ,S(f3 q )) T , the associated scores, and the true 
information matrix is /, then under the null of no marker 
being associated with the outcome, S T I~ 1 S ~ X 2 . Simi- 
larly, and as described above, the summary statistic based 
test uses the Z-statistics associated with univariate logistic 
regression models, Z, and the marker correlation matrix, 
V, so that under the null hypothesis and assuming V is 
perfectly known, T = Z T V~ 1 Z - X 2 . While both of these 
approaches use similar information (i.e., some measure of 
SNP significance not controlling for other SNPs and an 
estimate relating to the correlation of those measures), in 
simulation the summary statistic based approach seems 
to have slightly less power than the Hotellings T 2 test, 
but the difference is almost non-existent in many cases 
(Figure 4), and the summary statistic based test also 
seems to be more conservative, again assuming a perfectly 
known correlation structure V. Importantly, however, the 
summary statistic approach does not require the original, 



o 




0.2 0.4 0.6 0.8 1.0 

Effect Size (Log OR) 

Figure 4 Power comparison for a Hotelling's T 2 approach versus 
summary statistic test as a function of effect size. Power 
comparison of Moskvina et al.'s method (solid, black line) with our 
summary statistic based method (dashed, blue line) when the causal 
SNP is in a block of high LD. This graph looks identical to that when 
the causal SNP is not in a block of high LD because the power for 
both methods is not a function of the correlation of the SNPs and 
where causal SNPs are located relative to regions of correlation. Q-Q 
plot of permutation-based gene test statistic under alternative for 
correlated region versus uncorrelated region. 
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individual-level data, which is not the case with Hotellings 
T 2 . Power for the summary statistic based approach does 
not vary as a function of whether the causal SNP in the 
gene is in a region of LD or not (Figure 5), and, for a con- 
stant effect size, power does not vary as a function of the 
size of the LD block in the gene (Figure 6). 

Simulations were generated under the same frame- 
work as we used with the permutation test simulation 
above. Covariates were generated with a minor allele fre- 
quency of approximately 0.3, and, within any LD block, 
correlation between SNPs was again approximately 0.65, 
whereas SNPs not in the LD block were independent 
of one another. We assumed Hardy- Weinberg equilib- 
rium. The gene consisted of 20 SNPs, and there were 
600 subjects with an equal number of cases and con- 
trols. Power calculations were based on 1000 iterations 
at each effect size (e.g., Figure 5) or LD block size (e.g., 
Figure 6). Lastly, binary outcomes were generated assum- 
ing a logistic regression model, where presence of the 
causal SNP determined the probability of being a case or 
control. 

Simulation results for Eigen decomposition-based test 

We examine in simulation performance of the dimension- 
reduced score test when a single causal SNP was in an LD 
block and compare this proposed test with the method 
described in [2,3], in addition to 1 d.f. score and Wald tests 
of the causal SNP and a 1 d.f. Wald test of a tagging SNP. 
Figure 7 shows relative performance of these methods 
when there was no LD between SNPs in the gene, while 
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^ 1 1 1 

0 5 10 15 20 
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Figure 6 Power comparison for summary statistic test when 
causal SNP is in an LD block versus not in an LD block as a 
function of size of the LD block. Test working as desired since 
power is constant across LD block sizes for fixed effect size regardless 
of whether the causal SNP is in the LD block (solid, black line) or no in 
the LD block (dashed, blue line). 



Figure 8 compares methods when correlation was approx- 
imately 0.15 between any pair of SNPs in the gene. We see 
that the performance of the Eigen decomposition-based 
test performed better relative to the method proposed 
in [2,3] when the LD block was more weakly correlated. 
As the correlation increases, power of these two methods 
converges. Direct testing of the causal SNP, be it through a 
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0.0 0.2 0.4 0.6 0.8 1.0 1.2 

Effect Size (Log OR) 

Figure 5 Power comparison for summary statistic test when 
causal SNP is in an LD block versus not in an LD block. A SNP in 

high LD with other, non-causal SNPs (solid, black line), has no more 
power to be detected with the summary statistic test than a SNP in 
low LD (dashed, blue line). 




0.0 0.2 0.4 0.6 0.8 



effect size 

Figure 7 Power comparison for different dimension-reduction 
tests when there is no correlation in gene. Power comparison 
between the Eigen-based test (red, dashed line) and Gauderman's 
method (solid, black line), along with a direct test of the causal SNPs 
(Wald and Score test lines are purple and blue, dashed, on top of one 
another) and tagging SNP under no LD (light blue, dashed line). 
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Figure 8 Power comparison for different dimension-reduction 
tests when there is moderate correlation {p w 0.25) in gene. 

Power comparison between the Eigen-based test (red, dashed line) 
and Gauderman's method (solid, black line underneath line referring 
to the Eigen test), along with a direct test of the causal SNPs (Wald 
and Score test lines are purple and blue, dashed, on top of one 
another) and tagging SNP under high LD (light blue, dashed line). 



Wald or score test, performed best as expected, though of 
course knowledge of the true causal SNP is generally never 
known. Thus, we note that the Eigen decomposition- 
based test performs better than testing of a tagging SNP 
and makes unnecessary the need to decide which tag- 
ging SNP to use. In Figure 8, even under weak LD, the 
Eigen decomposition-based test pays little price in terms 
of power for no knowledge of the true causal SNP. 

The simulation framework for the Eigen decomposition 
test differed slightly from that of the permutation-based 
test and the summary statistic test. The gene consisted of 
15 SNPs, and there were 800 subjects with an equal num- 
ber of cases and controls. Power calculations were based 
on 2000 iterations at each of effect size (Figures 7 and 8). 
We calculated power at 11 different effect sizes, with the 
effect size ranging from a log OR of 0 to 1.0. As described 
in the previous paragraph, correlation scenarios varied 
between independence of SNPs (Figure 7) to mutual cor- 
relation of SNPs (Figure 8). Minor allele frequency again 
fell in the range of "common" variants at 0.3 with Hardy- 
Weinberg equilibrium assumed. Binary outcomes were 
generated assuming a logistic regression model, and pres- 
ence of the causal SNP and the effect size determined 
probability of being a case or control. 

Data analysis of oral cleft sequence data and COPD 
summary statistics 

We analyze a sequence data set composed of 192 cases 
of cleft lip and 192 controls, on whom we have data for 



14 SNPs. The data come from a GWAS in which a can- 
didate gene was identified and then sequenced [16]. We 
prune the data set so that any observations with miss- 
ing values or deletions are excluded, giving 172 cases and 
176 controls. We also prune SNPs so that any SNP with 
a MAF less than 0.02 among either cases or controls is 
excluded, leaving 8 SNPs. We calculate the correlation 
matrix of SNPs by pooling cases and controls. Using the 
summary statistic based test, we find that the region com- 
posed of the 8 SNPs is associated with cleft lip (p=0.06). 
Using the Eigen decomposition based test with the two 
eigenvectors whose associated eigenvalue is bigger than 
the average eigenvalue, we calculate a p- value of 0.017; 
using 3 eigenvectors such that more than 80% of the varia- 
tion in scores is explained, we calculate a p-value of 0.016. 
Thus, as is consistent with the potential power gains posed 
by dimension reduction, this latter test shows a stronger 
association between the region of 8 SNPs and cleft lip. 
For comparison, we also calculated a permutation test p- 
value, giving 0.008 (5000 permutations), and a Hotellings 
T 2 p-value, giving 0.056 (non-parametric, permutation- 
based p-value for this test gives 0.057). Assuming little 
correlation among SNPs, one would expect the permuta- 
tion test p-value to give a p-value similar to that of the 
summary statistic based test. The greater significance of 
the permutation test p-value suggests that a significant 
SNP is in LD with other SNPs and examination of the 
data matrix confirms this idea; a SNP whose p-value is 
0.018 using a univariate logistic regression model is highly 
correlated with one SNP (r=0.71) and moderately corre- 
lated with another SNP (r=0.43). Since only 8 SNPs are 
being analyzed, these two SNPs in LD with the significant 
SNP may be driving the significance of the permutation 
test. 

We also apply the summary-statistic based test to results 
borrowed from an already-published GWAS along with 
information on the correlation of markers taken from 
HapMap [10,13]. Pillai et al. (2009) identified 5 SNPs in 
the CDKAL1 gene on Chromosome 6 to be associated 
with Chronic Obstructive Pulmonary Disorder (COPD). 
We run our summary statistic based test on their results. 
Since the results come from a study of Norwegians, we 
use the (CEU) reference panel from HapMap as an esti- 
mate of the correlation structure of SNPs. The underlying 
population is unlikely to be identical, however, and so we 
adjust the correlation matrix, shrinking the off-diagonal 
elements toward 0 as described in the modification of the 
summary statistic based test to preserve the type 1 error 
rate. We do so with y values of 1 (i.e., assuming the corre- 
lation structure is correct), 0.9, and 0.8, and corresponding 
p-values for the 5 d.f. test are 0.0066, 0.003, and 0.001. 
While the summary statistics we use are borrowed from 
the 100 most significant SNPs of their analysis [10], the 
high level of significance for tests corresponding to all y 
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values and non-arbitrary choice all SNPs in the chosen 
gene suggest that there is likely some association between 
the CDKAL1 gene and COPD. Since the test statistics are 
themselves random variables, specific realizations of them 
are not necessarily associated with increasing p-values as 
one would anticipate with decreases in y. However, work 
above has shown that, in general, modest decreases in y 
should help preserve type 1 error. 

Conclusion 

With the availability of sequence data and GWAS, the 
importance of statistical analysis is shifting from single- 
locus tests to multi-loci tests that can cover genomic 
regions, e.g., genes or even pathways. The motivation for 
this development is to test a hypothesis more grounded 
in biology and, at the same time, to reduce the multiple 
testing problem and allow for many SNPs with a small 
effect size to increase the power of the test by their 
combined inclusion in the model. One of the theoreti- 
cal issues that has so far not been addressed adequately 
is the impact of LD on the power of the test statistic 
in permutation-based gene tests. Controlling for the LD 
between loci is important to assess the relative importance 
of the different regions that are tested, especially when 
LD heterogeneity between regions is significant. In this 
paper, we have proposed 2 approaches that address this 
issue. 

While our summary statistic based test may give one 
similar results to a Hotellings T 2 based test, the sum- 
mary statistic test does not require the original marker 
data from which Z-statistics are calculated. This unique 
advantage opens up the possibility for more in-depth anal- 
ysis of previously published studies, and, with sufficient 
methodological development, could even suggest sum- 
mary statistic based pathway analyses when combined 
with summary statistics from expression analyses. It also 
opens up the possibility of cross-study gene-based tests, 
where Z-statistics from the same markers are combined 
across previously published GWAS to reap power gains. 
A shortcoming of our summary statistic based test is that 
if the estimated correlation structure used in the test is 
not reflective of the underlying population, the test may 
suffer from inflated type 1 error. We therefore proposed 
an modification of the test by adjusting the estimated 
correlation matrix, which, in general, should help con- 
trol the error rate. If there is insufficient justification 
for why the estimated correlation matrix is representa- 
tive of the underlying population, the test should not be 
used even with correlation matrix adjustment. If one has 
the original SNP data, one can perform the dimension- 
reduced score test proposed in this paper, reducing 
the dimension to the number of eigenvectors that 
explains some pre-determined proportion of variation in 
the data. 



Both of the proposed gene-based tests in this paper 
fail to describe the direction of association between the 
gene and outcome, instead describing only significance of 
association. Direction of association is a difficult concept 
to interpret when a gene is composed of multiple SNPs, 
with some alleles protective and others a risk factor for 
the outcome. One goal in gene-based testing might be to 
gain an understanding of such a concept. Additionally and 
with regard to dimension reduction approaches, if alleles 
in a dimension-reduced block of SNPs are both protec- 
tive and harmful, there could be a loss of power using a 
dimension-reduction gene-based test. A test that used a 
priori analyses to decide whether alleles are protective or 
harmful and, in turn, used that information to inform the 
dimension reduction process might be another valuable 
area of research in gene-based testing. 

Additional file 



Additional file 1 : R script for test implementation and checking 
power variation for a permutation-based test. An R script 
implementing the summary statistic test with correlation matrix 
modification can be found at http://db.tt/bZz59KNO. Additionally, the 
script includes a function with which researchers can see the variation in 
power of a permutation-based gene-test as a function of placement of the 
causal marker for a correlation matrix of their choosing. It allows the 
researcher to see if there would be significant variation so that previously 
implemented permutation-based tests can either be re-performed using 
alternative methods or more confidence can be placed in already 
implemented tests. The code is commented with a description of function 
arguments and intended use. 
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